









- The essential idea of floating point representation is that a fixed number of bits are used (usually 32 or 64) and that the binary point "floats" to where it is needed. Some of the bits of a floating point representation must be used to say where the binary point lies. The programmer does not need to explicitly keep track of it.
- IEEE (Institute of Electrical and Electronics Engineers) created a standard for floating point. This is the IEEE 754 standard, released in 1985 and updated in 2008. All "main stream" hardware and software follows this standard.































## **Accurate Arithmetic**

- The IEEE 754 Standard specifies additional rounding control
  - Extra bits of precision (guard, round, sticky).
  - Choice of rounding modes.
  - Allows programmer to fine-tune numerical behavior of a computation.
- Not all FP units implement all options
  - Most programming languages and FP libraries just use defaults.
- Trade-off between hardware complexity, performance, and market requirements.
- Rounding (except for truncation) requires the hardware to include extra bits during calculations
  - Guard bit used to provide one bit when shifting left to normalize a result (e.g., when normalizing after division or subtraction).
  - Round bit used to improve rounding accuracy.
  - Sticky bit used to support Round to nearest even; it is set to a 1 whenever a 1 bit shifts (right) through it (e.g., when aligning during addition/subtraction).

| Data transfer                                                               | Arithmetic                             | Compare      | Transcenden    |
|-----------------------------------------------------------------------------|----------------------------------------|--------------|----------------|
| FILD mem/ST(i)                                                              | FIADDP mem/ST(i)                       | FICOMP       | FPATAN         |
| FISTP mem/ST(i)                                                             | FISUBRP mem/ST(i)                      | FIUCOMP      | F2XMI          |
| FLDPI                                                                       | FIMULP mem/ST(i)                       | FSTSW AX/mem | FCOS           |
| FLD1                                                                        | FIDIVRP mem/ST(1)                      |              | FPTAN          |
| FLDZ                                                                        | FSQKT                                  |              | FPREM          |
|                                                                             | FABS                                   |              | FPSIN          |
|                                                                             | FKNDINI                                |              | FYL2X          |
| <ul> <li>Optional var</li> <li>1: integer o</li> <li>P: pop oper</li> </ul> | iations<br>perand.<br>rand from stack. |              | FPSIN<br>FYL2X |



- ALUs are typically designed to perform 64-bit or 128-bit arithmetic.
- Some data types are much smaller, e.g., bytes for pixel RGB values, half-words for audio samples.
- Partitioning the carry-chains within the ALU can convert the 64-bit adder into 4 16-bit adders or 8 8-bit adders.
- A single load can fetch multiple values, and a single add instruction can perform multiple parallel additions, referred to as subword parallelism.

